Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 21, 2025

closes #115

The is_subsequence_in_list function was using slice-based matching, which only detected contiguous subsequences. This caused GSP to miss valid patterns like ('a', 'c') in sequence ['a', 'b', 'c'], violating standard GSP algorithm semantics.

Changes

Core algorithm (gsppy/utils.py):

  • Replaced slice-based contiguous matching with two-pointer ordered subsequence detection
  • Maintains O(n) time complexity while correctly identifying non-contiguous patterns

Tests:

  • Updated test_is_subsequence_in_list to verify ordered (non-contiguous) matching behavior
  • Updated test_frequent_patterns expected values to reflect additional patterns now detected
  • Added test_non_contiguous_subsequences covering the reported case
  • Added test_contiguous_vs_non_contiguous_patterns demonstrating patterns found in both modes vs only non-contiguous
  • Added test_non_contiguous_with_longer_gaps testing matching with gaps of varying sizes
  • Added test_order_sensitivity verifying order requirements are enforced
  • Added test_is_subsequence_contiguous_vs_non_contiguous documenting matching differences at utils level
  • Added test_is_subsequence_with_gaps testing gap sizes from 1 to 8+ elements

Documentation:

  • Updated module and function docstrings to clarify ordered (non-contiguous) matching
  • Updated README.md with comprehensive documentation:
    • Added non-contiguous matching to Key Features section with clear examples
    • Updated sample output to show actual patterns detected (including non-contiguous patterns)
    • Added new "Understanding Non-Contiguous Pattern Matching" section with practical examples
    • Clarified throughout that patterns are matched in order, not necessarily contiguously

Example

from gsppy.gsp import GSP

sequences = [
    ['a', 'b', 'c'],
    ['a', 'c'],
    ['b', 'c', 'a'],
    ['a', 'b', 'c', 'd'],
]

gsp = GSP(sequences)
result = gsp.search(min_support=0.5)

# Before: ('a', 'c') not detected
# After: ('a', 'c') correctly found with support=3

Test Coverage

The comprehensive test suite clearly demonstrates:

  • Patterns found in both contiguous and non-contiguous modes: e.g., ('X', 'Y') in ['X', 'Y', 'Z']
  • Patterns ONLY found in non-contiguous mode: e.g., ('X', 'Z') in ['X', 'Y', 'Z']
  • Gap tolerance: Patterns are found with gaps of various sizes (1 to 8+ elements)
  • Order sensitivity: Wrong order patterns like ('E', 'A') in ['A', 'E'] are correctly rejected

All 44 tests pass (39 original + 5 new comprehensive tests).

Original prompt

develop this issue: #115


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

- Updated is_subsequence_in_list to use two-pointer approach for ordered matching
- Modified tests to reflect correct GSP behavior for non-contiguous patterns
- Added test_non_contiguous_subsequences to verify the fix
- Updated module docstrings to reflect the change

Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix issue 115 related to data processing Fix GSP to detect non-contiguous ordered subsequences Dec 21, 2025
@jacksonpradolima
Copy link
Owner

@copilot This was the previous try to address this issue: #116
is there something we can reuse?

Copy link
Contributor Author

Copilot AI commented Dec 21, 2025

@copilot This was the previous try to address this issue: #116
is there something we can reuse?

I reviewed PR #116. The main difference is the approach:

PR #116: Added a contiguous parameter (default=False) to allow users to choose between contiguous and non-contiguous matching modes. This provides flexibility but adds API complexity.

Current PR: Modified the core algorithm to always use non-contiguous matching, aligning with canonical GSP semantics. This is simpler and fixes the reported bug directly.

Both approaches are valid. PR #116 offers more flexibility, while the current approach is simpler and follows the principle that GSP should detect ordered subsequences by default (as stated in the issue).

If you prefer the parameterized approach from PR #116, I can adapt that implementation. However, I'd recommend clarifying in the issue whether:

  1. GSP should always use non-contiguous (standard semantics) - current PR
  2. GSP should support both modes via parameter - PR Addition of non-contiguous search and parameterization #116 approach

The current implementation passes all tests and correctly detects patterns like ('a', 'c') in ['a', 'b', 'c'].

@jacksonpradolima
Copy link
Owner

@copilot Ok, should we add more tests to show the differences (contiguous and non-continuous) working?

…tterns

- Added test_contiguous_vs_non_contiguous_patterns to show patterns found in both modes
- Added test_non_contiguous_with_longer_gaps to test matching with large gaps
- Added test_order_sensitivity to verify order requirements
- Added test_is_subsequence_contiguous_vs_non_contiguous in utils tests
- Added test_is_subsequence_with_gaps to test various gap sizes
- All 44 tests pass (5 new tests added)

Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Dec 21, 2025

@copilot Ok, should we add more tests to show the differences (contiguous and non-continuous) working?

Added 5 comprehensive tests demonstrating the differences between contiguous and non-contiguous pattern matching (commit 860869c):

GSP-level tests (test_gsp.py):

  • test_contiguous_vs_non_contiguous_patterns - Shows patterns found in both modes vs only non-contiguous
  • test_non_contiguous_with_longer_gaps - Tests matching with gaps of 1-3+ elements
  • test_order_sensitivity - Verifies order requirements are enforced

Utils-level tests (test_utils.py):

  • test_is_subsequence_contiguous_vs_non_contiguous - Documents matching differences
  • test_is_subsequence_with_gaps - Tests gap sizes from 1 to 8+ elements

All 44 tests pass. The tests clearly demonstrate:

  • Patterns that appear contiguously work in both modes: ('X', 'Y') in ['X', 'Y', 'Z']
  • Patterns with gaps only work in non-contiguous mode: ('X', 'Z') in ['X', 'Y', 'Z']
  • Order is always enforced: ('c', 'a') doesn't match in ['a', 'b', 'c']

@jacksonpradolima
Copy link
Owner

@copilot, all is working, but the README isn't updated properly.

- Added ordered (non-contiguous) matching to Key Features section
- Updated sample output to reflect additional patterns detected
- Added detailed "Understanding Non-Contiguous Pattern Matching" section with example
- Updated notes to clarify patterns are matched in order, not contiguously
- Clarified examples to explain gap tolerance in pattern matching

Co-authored-by: jacksonpradolima <7774063+jacksonpradolima@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Dec 21, 2025

@copilot, all is working, but the README isn't updated properly.

Updated README.md with comprehensive documentation (commit 0501c91):

Key Features section: Added non-contiguous matching as the first feature with example

Sample Output: Updated to show the actual patterns detected with non-contiguous matching (e.g., ('Bread', 'Beer') now appears)

New section "Understanding Non-Contiguous Pattern Matching": Added with:

  • Clear explanation of ordered vs contiguous matching
  • Visual example showing which patterns are found
  • Practical demonstration using ['a', 'b', 'c'] sequences

Updated notes: Clarified that patterns are matched "in order (not necessarily contiguously)" throughout the documentation

@sonarqubecloud
Copy link

@jacksonpradolima jacksonpradolima marked this pull request as ready for review December 21, 2025 21:25
Copilot AI review requested due to automatic review settings December 21, 2025 21:25
@jacksonpradolima jacksonpradolima merged commit 8367b3f into master Dec 21, 2025
14 checks passed
@jacksonpradolima jacksonpradolima deleted the copilot/fix-issue-115 branch December 21, 2025 21:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical bug in the GSP algorithm where the is_subsequence_in_list function was using slice-based contiguous matching instead of ordered (non-contiguous) subsequence detection. This caused the algorithm to miss valid patterns like ('a', 'c') in sequence ['a', 'b', 'c'], violating standard GSP semantics.

Key Changes:

  • Replaced slice-based contiguous matching with a two-pointer ordered subsequence detection algorithm
  • Updated test expectations to reflect additional non-contiguous patterns now correctly detected
  • Added comprehensive test coverage for non-contiguous pattern matching scenarios
  • Updated documentation to clearly explain ordered (non-contiguous) matching behavior

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
gsppy/utils.py Replaced slice-based contiguous matching with two-pointer algorithm for ordered (non-contiguous) subsequence detection while maintaining O(n) time complexity
tests/test_utils.py Updated and expanded tests to verify ordered matching behavior, including tests for gaps of various sizes and order sensitivity
tests/test_gsp.py Updated expected values to reflect additional non-contiguous patterns and added 5 comprehensive tests demonstrating pattern detection with gaps
README.md Updated documentation with clear explanations of non-contiguous matching, practical examples, and a new dedicated section explaining the matching behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GSP fails to detect non-contiguous subsequences (e.g., ('a','c')) due to contiguous matcher

2 participants